Use case example¶
This page shows one complete workflow from login to production execution.
Prerequisites¶
Before starting:
- account created: Account creation
- SSH access configured: Connection and file transfer
- partition basics known: Slurm (quick guide)
Goal¶
Run train.py first in an interactive GPU session, then in batch mode on prod10.
1) Connect and create a project folder¶
ssh dgx
# if you do not use an SSH alias:
# ssh <username>@hubia-dgx.centralesupelec.fr
mkdir -p ~/my_project
cd ~/my_project
2) Create a Python virtual environment¶
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install numpy torch
Quick check:
python -c "import numpy; print(numpy.__version__)"
3) Test on GPU with an interactive session¶
srun -p interactive10 --time=00:30:00 --pty bash
Inside the interactive shell:
cd ~/my_project
source venv/bin/activate
python train.py
exit
4) Prepare a batch script¶
Use the default template:
cd ~/my_project
cp ~/slurm-prod10.sbatch ./job.sbatch
nano job.sbatch
Set at least:
- job name (
#SBATCH --job-name=...) - time (
#SBATCH --time=...) - Python command (
python ...)
5) Submit and monitor¶
sbatch job.sbatch
squeue -u $USER
Inspect one job:
scontrol show job <jobid>
sacct -j <jobid> --format=JobID,State,Elapsed,ExitCode
Read logs:
tail -n 100 slurm-<jobid>.out
Cancel if needed:
scancel <jobid>
6) Scale up only when needed¶
If the model does not fit in prod10 (10 GB VRAM), move to larger partitions:
prod40(40 GB VRAM)prod80(80 GB VRAM)
Keep the same workflow, only change partition/time and script content.
7) Optional: work from VS Code¶
You can use VS Code Remote-SSH to edit files on the DGX.
If the extension gets stuck, try:
Remote-SSH: Uninstall VS Code Server from Host- reconnect to the host
Next references¶
- Slurm command reference: Slurm jobs management
- Advanced scheduling policy: Advanced partitions
- GPU/MIG mapping: GPU and MIG layout